Skip to content

Users/kirankk/cosmos refresh reason#5791

Draft
kirankumarkolli wants to merge 5 commits intomsdata/directfrom
users/kirankk/cosmos-refresh-reason
Draft

Users/kirankk/cosmos refresh reason#5791
kirankumarkolli wants to merge 5 commits intomsdata/directfrom
users/kirankk/cosmos-refresh-reason

Conversation

@kirankumarkolli
Copy link
Copy Markdown
Member

Pull Request Template

Description

Please include a summary of the change and which issue is fixed. Include samples if adding new API, and include relevant motivation and context. List any dependencies that are required for this change.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

kirankumarkolli and others added 2 commits April 18, 2026 11:24
Adds a design-time-bounded enum + wire-value table so every SDK-forced
cache refresh can be attributed to a specific cause via the generic
`x-ms-cosmos-refresh-reason` HTTP header. This is the foundation for
tagging address-cache refreshes (and, in future PRs, PK-range-cache,
collection-routing-map, etc.).

Phase 1 scope:
  * `RefreshReason` internal enum with 23 explicitly-numbered members
    (Groups A-D covering server 410, Gone+substatus, transport-synthesized
    Gone, and non-Gone forced refreshes).
  * `RefreshReasonExtensions`:
      - `WireValues` dictionary (single source of truth for enum -> wire
        string; values flat, <=2 dot-segments).
      - `ToHeaderValue` for serialization.
      - `FromTransportErrorCode` exhaustive switch mapping every
        `TransportErrorCode` to a specific reason; default -> `GoneUnknown`.
  * Header constant `HttpConstants.HttpHeaders.CosmosRefreshReason` in
    `src/direct/` (pending upstream `Microsoft.Azure.Cosmos.Direct`
    parallel PR).
  * Carrier field `DocumentServiceRequestContext.RefreshReason` (default
    `Unspecified`) copied by `Clone()`.
  * 12 unit tests validating enum/wire-value coverage, uniqueness, regex
    conformance, and exhaustive transport-code mapping.

No egress wiring or call-site tagging yet - those land in Phase 2/3.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Plumbs the RefreshReason through the two address-cache force-refresh
egress functions and emits the `x-ms-cosmos-refresh-reason` header
when the reason resolves to non-Unspecified. No call-site tagging yet
(that lands in Phase 3); this commit only adds the wiring + test-only
validator so Phase 3 edits are mechanical.

Changes (Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs):
  * Static test-only toggle `ValidateRefreshReasonPresence` (default
    `false`; tests set `true`). When enabled, a forced refresh with
    `Unspecified` reason throws `InvalidOperationException` so any
    future untagged force-refresh site is caught automatically.
  * New private helper `EmitRefreshReasonHeader` centralizes reason
    resolution (`explicitReason` precedence over
    `request.RequestContext.RefreshReason`), validator check, and
    header emission.
  * `GetMasterAddressesViaGatewayAsync` + `GetServerAddressesViaGatewayAsync`
    take an optional `RefreshReason explicitReason = Unspecified` and
    call the helper when `forceRefresh` is true.
  * `GetAddressesForRangeIdAsync` takes the same optional parameter
    and forwards it; this is the path used by the on-demand
    unhealthy-URI revalidation site (to be tagged in Phase 3 as
    `ReplicaHealthUnhealthyLongLived`) which does not own the
    originating `DocumentServiceRequest`.

Zero production overhead: validator is a single bool read in the
`forceRefresh` branch, default-off.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

Tag every call site that flips ForceRefreshAddressCache = true (or
equivalent background refresh) with the originating RefreshReason so
the outgoing /addresses request carries x-ms-cosmos-refresh-reason.

Sites tagged:
- GatewayAddressCache suboptimal-server-partition timer -> InsufficientReplicasSuboptimalTimer
- GatewayAddressCache unhealthy-URI background refresh -> ReplicaHealthUnhealthyLongLived (via explicitReason)
- GatewayAddressCache master-suboptimal trigger -> InsufficientReplicasSuboptimalTimer
- ConsistencyWriter barrier request -> InsufficientReplicasQuorum
- QuorumReader barrier request -> InsufficientReplicasQuorum
- StoreReader Gone-retry paths (2) -> classified via new ClassifyPriorGone
  helper walking prior StoreResult exception chain (TransportException.ErrorCode
  for transport-synth 410, else substatus mapping, else GoneServer/GoneUnknown)

New helper:
- RefreshReasonExtensions.ClassifyGoneFromException: centralized exception+substatus
  -> RefreshReason mapping

Tagging preserves upstream-set reasons by only writing when still Unspecified.
Build green (0 warn / 0 err); 12 unit tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

Full plan document (problem statement, design choices, 23-value enum
reference with groupings, file inventory covering this repo and upstream
Direct, test surface, phase status, risks) so the PR carries the design
context inline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.

- Factor FromTransportErrorCode into an internal TryMapKnownTransportErrorCode
  helper that returns false for unhandled codes. The public API still falls
  back to GoneUnknown for unknown codes (graceful in production), but the
  exhaustive unit test now uses TryMap directly, so any new upstream
  TransportErrorCode will fail CI instead of silently folding to GoneUnknown.

- Promote GatewayAddressCache.EmitRefreshReasonHeader from private to internal
  so the validator/precedence behavior can be unit-tested without a full
  gateway round-trip.

- Add 7 ClassifyGoneFromException tests covering every SubStatusCodes branch
  (CompletingSplit, CompletingPartitionMigration, NameCacheIsStale,
  PartitionKeyRangeGone) and the default GoneServer fallback.

- Add RefreshReasonEmissionTests with 6 tests for precedence (explicit > ctx),
  null-request path (on-demand unhealthy-URI refresh), default no-op when
  untagged, and the opt-in ValidateRefreshReasonPresence invariant that
  guards future force-refresh call sites from shipping untagged.

Tests: 25/25 Routing/RefreshReason pass; 22/22 GatewayAddressCache pass.
Build: clean (0 warn / 0 err).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines:
1 pipeline(s) were filtered out due to trigger conditions.


## 1. Problem

Every forced address-cache refresh makes the SDK call `GET …/addresses?…` with
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it does not have is context on what the connect was to (namely which endpoint(s) failed.
Should we from the start, support delimiting the header so we can suffix additional information.
So something like this
x-ms-cosmos-refresh-reason: gone.connect; endpoints=rntbd://cdb-ms-prod-eastus1-be207.documents.azure.com:14464/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant